-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Fix regression in groupby.rolling.corr/cov when other is same size as each group #44824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
mroeschke
commented
Dec 9, 2021
- closes BUG: df.groupy().rolling().cov() does not group properly and the cartesian product of all groups is returned instead #42915
- tests added / passed
- Ensure all linting tests pass, see here for how to run them
- whatsnew entry
nice ping on greenish |
greenish |
# When we have other, we must reindex (expand) the result | ||
if other is not None and not all( | ||
len(group) == len(other) for group in self._grouper.indices.values() | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does this fix relate to the changes in the PR that caused the regression? (maybe should put that in the issue rather than respond here)
I not familiar with this section of the codebase but this fix looks very specific to the reported case.
for instance, varying the OP example in the issue slightly
df_a[:-1].groupby(level=0).rolling(2).cov(df_b)
on 1.2.5
value
idx1 idx1 idx2
1 1 1 NaN
2 0.5
3 0.5
4 0.5
5 0.5
2 2 1 NaN
2 0.5
3 0.5
4 0.5
on 1.3.x/master/this PR
value
idx1 idx1 idx2
1 1 1 NaN
2 0.5
3 0.5
4 0.5
5 0.5
2 1 NaN
2 NaN
3 NaN
4 NaN
2 1 1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
2 1 NaN
2 0.5
3 0.5
4 0.5
which now seems inconsistent with the df_a.groupby(level=0).rolling(2).cov(df_b)
case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In OPs example, len(other) == len(all groups)
.
In your df_a[:-1]
twist, len(other) != len(all groups)
, and comparing to the equivalent groupby(level=0).apply(lambda x.rolling(2).cov(df_b))
case the result is
In [15]: def func(x):
...: return getattr(x.rolling(2), f)(df_b)
In [16]: df_a = pd.DataFrame({"value": range(10), "idx1": [1] * 5 + [2] * 5, "idx2": [1, 2, 3, 4, 5] * 2}).set_index(["idx1", "idx2
...: "])
...: df_b = pd.DataFrame({"value": range(5), "idx2": [1, 2, 3, 4, 5]}).set_index("idx2")
In [18]: f = "cov"
In [19]: df_a[:-1].groupby(level=0).apply(func)
Out[19]:
value
idx1 idx2
1 1 NaN
2 0.5
3 0.5
4 0.5
5 0.5
2 1 NaN
2 0.5
3 0.5
4 0.5
But we already have a similar unit test to the twist, window/test_groupby.py:TestRolling:test_rolling_corr_cov
where len(other) != len(all groups)
and tests against the equivalent groupby(...).apply(lambda x.rolling(...).cov(...))
case. The result is that other
is reindexing each group with other.index
which "expands the result"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bottom line:
I do acknowledge this fix seems a bit brittle.
The behavior where len(other) != len(all groups)
is a bit opaque to me because comparing to the groupby.apply(lambda x.rolling)
equivalent expression doesn't to have a discernable pattern to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that comparison to apply might not be appropriate - in general the built-in op should produce results that are potentially better than apply because apply has be agnostic to the operation.
That said, the current output on master from the example in #44824 (comment) doesn't make sense to me. It appears to me that pandas is creating a record with (idx1, idx1, ...)
being (1, 2, ...)
and then declaring the value to be null because the record (that pandas created) is inconsistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems to make sense to me since the second level was never involved in the groupby.
If you have a change could you evaluate if this test behavior seems correct? It currently tries to match the equivalent groupby.apply
version but I am not sure if there is a more sensible result.
pandas/pandas/tests/window/test_groupby.py
Line 151 in 193ca73
def test_rolling_corr_cov_other_diff_size_as_groups(self, f): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would the expected behavior of frame1.cov(other=frame2)
be? Currently this argument doesn't exist, but if it did, I think the expectation would be to first align frame2 to frame1's index and then compute the covariance.
Under this assumption, I think a more sensible result in the test linked to above would be to do the same alignment for each window during the rolling operation. This also has the benefit of making
frame.groupby("A").rolling(window=2).cov(other=frame)
be the same as
frame.groupby("A").rolling(window=2).cov()
However if the answer to the first question is to have null values for rows in frame2 that don't exist in frame1, then the current behavior of the test above is the most sensible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I agree that it's probably make sense index alignment to happen between the group and other before computing covariance.
thanks @mroeschke |
@meeseeksdev backport 1.3.x |
…corr/cov when other is same size as each group
Something went wrong ... Please have a look at my logs. |
…hen other is same size as each group (#44848) Co-authored-by: Matthew Roeschke <[email protected]>